Integrated Scoring For Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text
نویسندگان
چکیده
An increasing number of language and speech applications are gearing towards the use of texts from online sources as input. Despite such rise, not much work can be found in the aspect of integrated approaches for cleaning dirty texts from online sources. This paper presents a mechanism of Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). The idea of ISSAC was first conceived as part of the text preprocessing phase in an ontology engineering project. Evaluations of ISSAC using 400 chat records reveal an improved accuracy of 96.5% over the existing 74.4% based on the use of Aspell only.
منابع مشابه
Enhanced Integrated Scoring for Cleaning Dirty Texts
An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansio...
متن کاملDesign and implementation of Persian spelling detection and correction system based on Semantic
Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian progr...
متن کاملThree-Phase Text Error Correction Model for Korean SMS Messages
In this paper, we propose a three-phase text error correction model consisting of a word spacing error correction phase, a syllablebased spelling error correction phase, and a word-based spelling error correction phase. In order to reduce the text error correction complexity, the proposed model corrects text errors step by step. With the aim of correcting word spacing errors, spelling errors, a...
متن کاملCandidate Scoring Using Web-Based Measure for Chinese Spelling Error Correction
Chinese character correction involves two major steps: 1) Providing candidate corrections for all or partially identified characters in a sentence, and 2) Scoring all altered sentences and identifying which is the best corrected sentence. In this paper a web-based measure is used to score candidate sentences, in which there exists one continuous error character in a sentence in almost all sente...
متن کاملA Cascaded Approach for Social Media Text Normalization of Turkish
Text normalization is an indispensable stage for natural language processing of social media data with available NLP tools. We divide the normalization problem into 7 categories, namely; letter case transformation, replacement rules & lexicon lookup, proper noun detection, deasciification, vowel restoration, accent normalization and spelling correction. We propose a cascaded approach where each...
متن کامل